Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Algorithms for Binary Neural Networks

However, if the conditions in Eq. (3.148) are met, with Eq. (3.145) concluded, the gradi-

ent of ˆw^t⁺¹is formulated as:

∂L

∂ˆw^t⁺¹⁼^∂^L

∂ˆw^t⁻^η

∂²L

∂( ˆw^t)²^≥^γ,

η ^∂²^L

∂( ˆw^t)²^≤^∂^L

∂ˆw^t⁻^γ^≤−²^γ.

(3.149)

Note that η and γ are two positive variables, thus the second-order gradient

∂²L

∂( ˆw^t)²^<⁰

holds always. Consequently, L( ˆw^t⁺¹) can only be local maxima rather than a minimum,

which raises a contradiction to convergence in the training process. Such a contradiction

indicates that the training algorithm will be convergent until no oscillation occurs due to

the additional term S(α, w). Therefore, we complete our proof.

□

Our proposition and proof reveal that the balanced parameter γ is a “threshold.” A

minimal “threshold” fails to mitigate the frequent oscillation eﬀectively, while a too-large

threshold suppresses the necessary sign inversion and hinders the gradient descent process.

To solve this, we devise the learning rule of γ as:

γ^n,t⁺¹

M ⁿ^∥^b^w^n,t

⊛b^w^n,t⁺¹

−1∥0 ·

max

1≤j≤M ⁿ⁽^|^∂^L

∂ˆw^n,t

i,j

|),

(3.150)

where the ﬁrst element

M ⁿ^∥^b^w^n,t

⊛b^w^n,t⁺¹

−1∥0 denotes the proportion of weights with

change of sign. The second item max1≤j≤M ⁿ(|

∂L

∂ˆw^n,t

i,j ^|^{) is derived from Eq. (3.148), denoting}

the gradient with the greatest magnitude of the t-th iteration. In this way, we suppress the

frequent weight oscillation with a resilient gradient.

We further optimize the scaling factor as follows:

δαⁿ

i ⁼^∂^L

∂αⁿ

+ ^∂^L^R

∂αⁿ

(3.151)

The gradient derived from the softmax loss can be easily calculated based on backprop-

agation. Based on Eq. (6.88), it is easy to derive:

∂LR

∂αⁿ

= γⁿ

i ⁽^wⁿ

i ⁻^αⁿ

i ^b^wⁿ

i ) ⊛b^wⁿ

i .

(3.152)

3.9.3

Ablation Study

Since our ReBNN does not introduce additional hyperparameters, we ﬁrst evaluate the

diﬀerent calculations of γ. Then we show how our ReBNN achieves a resilient training

process. In the ablation study, we used the ResNet-18 backbone initialized from the ﬁrst

stage training with W32A1 following [158].

Calculation of γ: We compare the diﬀerent calculations of γ in this part. As shown in

Table 3.7, the performances increase ﬁrst and then decrease when the value of constant γ.

Considering that the magnitude of the gradient varies in both layer and channel senses, a

subtle γ can hardly be manually set as a global value. We further compare the gradient-based

calculation. As shown in the bottom lines, we ﬁrst use max1≤j≤M n(|

∂L

∂ˆw^n,t

i,j ^|^{), the maximum}

intrachannel gradient. of the last iteration, which performs similarly to the constant 1e−4.

This indicates that only using the maximum intra-channel gradient may suppress necessary